Conditional GWAS
Overview
Models
Here we present the results of a genome-wide association study (GWAS) conducted using pyseer (John A. Lees et al. 2018), a tool for identifying genetic associations with complex traits. The GWAS was performed using the linear mixed effects model such that the minimum inhibitory concentration (MIC) of Penicillin G: \[\mathbf{y} \sim \mathbf{W} \mathbf{a} + \mathbf{X} \mathbf{b} + \mathbf{K} \mathbf{u} +\epsilon\]
with
\[ \begin{align*} \mathbf{u} \sim \mathcal{N} \left( 0, \sigma^2_g \mathbf{G} \right), \quad &\mathbf{\epsilon} \sim \mathcal{N} \left( 0, \sigma^2_g \mathbf{I} \right) \end{align*} \]
Where \(\mathbf{G}\) is the similarity matrix that was obtained from the phylogenetic relationship of the samples, and \(\mathbf{I}\) is the identity matrix.
| Symbol | Meaning |
|---|---|
| 𝑦 | A vector containing the MIC data for each sample. |
| 𝑊 | A design matrix containing the conditional covariates: We used categorical variables for the variants of pencillin binding proteins pbp1a, pbp2b and pbp2x, coded according to the allele scheme proposed in (Y. Li et al. 2016), and benchmarked in (Y. Li et al. 2017). The allele codes were obtained by running the Pathogenwatch pipeline. We also included the study the data was obtained as a categorical variable, to capture differences in MIC measurement protocols. |
| 𝑎 | Fixed effects for the covariates. |
| 𝑋 | A design matrix containing the presence or absence of the unitigs. |
| 𝑏 | Fixed effects of the unitigs. |
| 𝐾 | The similarity matrix between all pairs of samples. We used a neighbor-joining phylogeny estimated using PopPUNK (John A. Lees et al. 2019). |
| 𝑢 | Random effects for each row of the kinship matrix. |
| ε | Normally distributed error terms. |
Thus we assume that the MIC can be predicted by the covariates with the correlation structure implied by the phylogenetic relationship of the samples. If we take the expectation of the equation we obtain \[E(\mathbf{y}) \sim \mathbf{W} \mathbf{a} + \mathbf{X} \mathbf{b} + \mathbf{K} \mathbf{u} +\epsilon\] \[ = E(\mathbf{W} \mathbf{a} + \mathbf{X} \mathbf{b} + \mathbf{K} \mathbf{u} +\epsilon)\] \[ = \mathbf{W} \mathbf{a} + \mathbf{X} \mathbf{b}\] and variance \[V(\mathbf{y}) = V(\mathbf{W} \mathbf{a} + \mathbf{X} \mathbf{b} + \mathbf{K} \mathbf{u} +\epsilon)\] \[ \begin{aligned} V(\mathbf{y}) = V(\mathbf{W} \mathbf{a} + \mathbf{X} \mathbf{b} + \mathbf{K} \mathbf{u} +\epsilon) \\ = V(\mathbf{K} \mathbf{u} + \epsilon) \\ = V(\mathbf{K} \mathbf{u}) + V(\epsilon) \\ = \sigma^2_g \mathbf{G} + \sigma^2_g \mathbf{I} \end{aligned} \]
Filtering of unitigs and defining significance threshold
We filtered the unitigs based on a minimum allele frequency (MAF) between [0.01, 0.99] meaning that the unitigs have to be present in more than 1% and less than 99% of the isolates. We used the number of patterns in the unitigs to define a significance threshold of the p-values of the unitigs. We detected 974212 unique patterns translating to a significance threshold of \(p = 5.13\cdot 10^{-08}\). All downstream results and visualizations were reported for unitigs with a smaller p-value than the threshold. Many of these hits were then annotated using the nomenclature of Bakta (Schwengers et al. 2021). The annotations can be looked up on NCBI for further information.
Results
Significant hits ordered by average effect size

Significant hits ordered by minor allele frequency (MAF)

Interactive plot of unitigs mapped to genes
In the figures below the unitigs are mapped to reference genomes and. As references we used multiple genomes, and additionally we annotated our
Summary of annotated regions matching the unitigs (Table)
Unitig sequences, significance, effect size and heritability (Table)
Unitigs mapped to reference genomes
In total we extracted 2527 significant unitigs. We mapped these unitigs to different reference genomes, summarized in the table below. Streptococcus pneumoniae has a large pan-genome, and none of the references contain all of the unitigs. Note that a high fraction of the unitigs were found in ST556, which represents a sequence from a multidrug-resistant isolate from an otitis media patient (G. Li et al. 2012).
| Reference strain | Description |
|---|---|
| ATCC70066 | Multidrug-Resistant Pandemic Clone Streptococcus pneumoniae. 885 out of 2527 unitigs were mapped to this genome. https://doi.org/10.1128/JB.01343-08 |
| Hu17 | high-level beta-lactam resistance and penicillin sensitive phenotype. 719 out of 2527 unitigs were mapped to this genome. https://www.ncbi.nlm.nih.gov/data-hub/genome/GCF_002076835.1/ |
| EF3030 | A serotype 19F isolate that colonizes the nasopharynx of mice while being mostly noninvasive. 1359 out of 2527 unitigs were mapped to this genome. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6509521/ |
| ST556 | A Multidrug-Resistant Isolate from an Otitis Media Patient. 2147 out of 2527 unitigs were mapped to this genome. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3370858/ |
| TIGR4 | From the paper: The acquisition of clinically relevant amoxicillin resistance in Streptococcus pneumoniae requires ordered horizontal gene transfer of four loci. 774 out of 2527 unitigs were mapped to this genome. https://pubmed.ncbi.nlm.nih.gov/35877768/ |
| SN75752 | From the paper: The acquisition of clinically relevant amoxicillin resistance in Streptococcus pneumoniae requires ordered horizontal gene transfer of four loci. 748 out of 2527 unitigs were mapped to this genome. https://pubmed.ncbi.nlm.nih.gov/35877768/ |